Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1061: Add TCP drop and DNS tracking hooks #115

Merged
merged 4 commits into from
Jun 24, 2023
Merged

NETOBSERV-1061: Add TCP drop and DNS tracking hooks #115

merged 4 commits into from
Jun 24, 2023

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Apr 26, 2023

  • Add skb_free tracepoint hook to detect when TCP flows are dropped and update flow metrics with tcp socket info.

  • To view ebpf tracepoints with bpftool

bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb
- using ncat
=========
server: ncat -l 8080
client: ncat 192.168.122.69 8080
  • use iptables rule to drop
    ============================
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j DROP
switch back
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j ACCEPT

once the drop iptable rule is installed we can see tcpdrop stats been updated

 },{
                "cpu": 2,
                "value": {
                    "packets": 1,
                    "bytes": 67,
                    "start_mono_time_ts": 5481943029602,
                    "end_mono_time_ts": 5481943029602,
                    "flags": 16,
                    "errno": 0,
                    "tcp_drops": {
                        "packets": 14,
                        "bytes": 800,
                        "flags": 16,
                        "state": 1,
                        "drop_cause": 8  <<<<< SKB_DROP_REASON_NETFILTER_DROP
                    }
                }	
            }
  • add net_dev_queue trace point hook to implement light weight DNS tracker
  • testing using dig google.coom
DNS id 23099
Query
 "dns_record": {
                "id": 23099,
                "flags": 288,
                "req_mono_time_ts": 3559391813403,
                "rsp_mono_time_ts": 0
}

Response
 "dns_record": {
                "id": 23099,
                "flags": 33152,
                "req_mono_time_ts": 0,
                "rsp_mono_time_ts": 3559392215340
}
~.4 msec latency

NOTE:

Related PRs:
netobserv/network-observability-operator#331
netobserv/flowlogs-pipeline#429
netobserv/network-observability-console-plugin#324

@codecov
Copy link

codecov bot commented Apr 26, 2023

Codecov Report

Merging #115 (2cffd02) into main (2d63d90) will decrease coverage by 0.38%.
The diff coverage is 44.66%.

@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
- Coverage   40.48%   40.11%   -0.38%     
==========================================
  Files          31       31              
  Lines        2060     2124      +64     
==========================================
+ Hits          834      852      +18     
- Misses       1186     1227      +41     
- Partials       40       45       +5     
Flag Coverage Δ
unittests 40.11% <44.66%> (-0.38%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/agent.go 39.20% <0.00%> (ø)
pkg/ebpf/tracer.go 0.00% <0.00%> (ø)
pkg/exporter/proto.go 82.73% <62.06%> (-17.27%) ⬇️
pkg/flow/record.go 71.42% <66.66%> (-4.58%) ⬇️
pkg/flow/account.go 82.69% <100.00%> (ø)
pkg/flow/tracer_map.go 79.41% <100.00%> (ø)

@msherif1234 msherif1234 changed the title Add TCP drop hook and update flows metrics WIP: Add TCP drop hook and update flows metrics Apr 26, 2023
@msherif1234 msherif1234 changed the title WIP: Add TCP drop hook and update flows metrics WIP: NETOBSERV-979: Add TCP drop hook and update flows metrics Apr 26, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 26, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

  • add skb_free tracepoint hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 27, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

  • add skb_free tracepoint hook
    to view ebpf tracepoints with bpftool
bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 27, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

inspired by https://www.brendangregg.com/blog/2018-05-31/linux-tcpdrop.html

  • add skb_free tracepoint hook
    to view ebpf tracepoints with bpftool
bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234 msherif1234 changed the title WIP: NETOBSERV-979: Add TCP drop hook and update flows metrics NETOBSERV-979: Add TCP drop hook and update flows metrics Apr 27, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 27, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

Add skb_free tracepoint hook to detect when TCP flows are dropped and update flow metrics with tcp socket info.

To view ebpf tracepoints with bpftool

bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb

For reference about tcpdrop https://www.brendangregg.com/blog/2018-05-31/linux-tcpdrop.html

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 27, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

  • Add skb_free tracepoint hook to detect when TCP flows are dropped and update flow metrics with tcp socket info.

  • To view ebpf tracepoints with bpftool

bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234
Copy link
Contributor Author

/assign @praveingk @jotak @jpinsonneau @ronensc

@msherif1234 msherif1234 added the enhancement New feature or request label Apr 27, 2023
@msherif1234 msherif1234 changed the title NETOBSERV-979: Add TCP drop hook and update flows metrics NETOBSERV-979: Add TCP drop hook and update flows metric Apr 27, 2023
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 28, 2023
@github-actions
Copy link

New image: ["quay.io/netobserv/netobserv-ebpf-agent:8197b19"]. It will expire after two weeks.

@jpinsonneau
Copy link
Contributor

@msherif1234 I tried to deploy this on a clusterbot using launch 4.13 aws,large but got the following error:

time="2023-04-28T07:46:46Z" level=debug msg="agent IP: 10.0.149.138" component=agent.Flows
time="2023-04-28T07:46:47Z" level=info msg="failed to attach the BPF program to kfree_skb tracepoint: reading file \"/sys/kernel/debug/tracing/events/skb/kfree_skb/id\": open /sys/kernel/debug/tracing/events/skb/kfree_skb/id: no such file or directory" component=ebpf.FlowFetcher
time="2023-04-28T07:46:47Z" level=fatal msg="can't instantiate NetObserv eBPF Agent" error="reading file \"/sys/kernel/debug/tracing/events/skb/kfree_skb/id\": open /sys/kernel/debug/tracing/events/skb/kfree_skb/id: no such file or directory"

I thought using privileged option in the CRD would solve it but it's not 😿

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
...
spec:
  agent:
    ebpf:
...
      privileged: true
...

Am I missing something ?

@msherif1234
Copy link
Contributor Author

@msherif1234 I tried to deploy this on a clusterbot using launch 4.13 aws,large but got the following error:

time="2023-04-28T07:46:46Z" level=debug msg="agent IP: 10.0.149.138" component=agent.Flows
time="2023-04-28T07:46:47Z" level=info msg="failed to attach the BPF program to kfree_skb tracepoint: reading file \"/sys/kernel/debug/tracing/events/skb/kfree_skb/id\": open /sys/kernel/debug/tracing/events/skb/kfree_skb/id: no such file or directory" component=ebpf.FlowFetcher
time="2023-04-28T07:46:47Z" level=fatal msg="can't instantiate NetObserv eBPF Agent" error="reading file \"/sys/kernel/debug/tracing/events/skb/kfree_skb/id\": open /sys/kernel/debug/tracing/events/skb/kfree_skb/id: no such file or directory"

I thought using privileged option in the CRD would solve it but it's not 😿

apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
...
spec:
  agent:
    ebpf:
...
      privileged: true
...

Am I missing something ?

Yes similar to what I added to e2e manifests we need to add the same mount volume to the crd

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 28, 2023
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 28, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

  • Add skb_free tracepoint hook to detect when TCP flows are dropped and update flow metrics with tcp socket info.

  • To view ebpf tracepoints with bpftool

bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb
using ncat
=========
server ncat -l 8080
client ncat 192.168.122.69 8080


use iptables rule to drop
============================
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j DROP
switch back
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j ACCEPT

run local flow collector with long time to allow flow to stay longer (2Minutes)
===============================================================================
sudo FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 CACHE_ACTIVE_TIMEOUT="120s" ./bin/netobserv-ebpf-agent


use bpftool to find the map and dump it
=======================================
sudo bpftool map list
sudo bpftool map dump id 500 | grep dst_port

use bpftool to look at prog traces
=================================
sudo bpftool prog tracelog

use bpftool to check tracepoint is attached
===========================================
bpftool perf show

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 28, 2023

@msherif1234: This pull request references NETOBSERV-979 which is a valid jira issue.

In response to this:

  • Add skb_free tracepoint hook to detect when TCP flows are dropped and update flow metrics with tcp socket info.

  • To view ebpf tracepoints with bpftool

bpftool perf show
pid 119418  fd 8: prog_id 134  tracepoint  kfree_skb
using ncat
=========
server ncat -l 8080
client ncat 192.168.122.69 8080


use iptables rule to drop
============================
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j DROP
switch back
sudo iptables -I INPUT 1 -m tcp --proto tcp --dst 192.168.122.69/32 --dport 8080 -j ACCEPT

run local flow collector with long time to allow flow to stay longer (2Minutes)
===============================================================================
sudo FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 CACHE_ACTIVE_TIMEOUT="120s" ./bin/netobserv-ebpf-agent


use bpftool to find the map and dump it
=======================================
sudo bpftool map list
sudo bpftool map dump id 500 | grep dst_port

use bpftool to look at prog traces
=================================
sudo bpftool prog tracelog

use bpftool to check tracepoint is attached
===========================================
bpftool perf show

once the drop iptable rule is installed we can see tcpdrop stats been updated

},{
               "cpu": 10,
               "value": {
                   "packets": 7,
                   "bytes": 540,
                   "start_mono_time_ts": 1047585604832,
                   "end_mono_time_ts": 1054159237862,
                   "flags": 16,
                   "errno": 0,
                   "tcp_drops": {
                       "packets": 7,
                       "bytes": 442,
                       "flags": 16,
                       "state": 1
                   }
               }

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 28, 2023
@openshift-ci openshift-ci bot removed the lgtm label Jun 20, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 20, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 20, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 20, 2023
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 20, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:5aac091. It will expire after two weeks.

@jotak
Copy link
Member

jotak commented Jun 21, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 21, 2023
bpf/utils.h Outdated Show resolved Hide resolved
} else if (id->eth_protocol == ETH_P_IPV6) {
*family = AF_INET6;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent for this commit but there's functionality being repeated in these functions and the fill_** functions so maybe we should look to just consolidate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can do so

@openshift-ci openshift-ci bot removed the lgtm label Jun 22, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 22, 2023
Signed-off-by: msherif1234 <mmahmoud@redhat.com>
Signed-off-by: msherif1234 <mmahmoud@redhat.com>
Signed-off-by: msherif1234 <mmahmoud@redhat.com>
fix lint errors
flatten icmp block

Signed-off-by: msherif1234 <mmahmoud@redhat.com>
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 23, 2023
@github-actions
Copy link

New image: quay.io/netobserv/netobserv-ebpf-agent:cfcda6b. It will expire after two weeks.

Copy link
Contributor

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@msherif1234
Copy link
Contributor Author

/approve

@openshift-ci
Copy link

openshift-ci bot commented Jun 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 27baf29 into netobserv:main Jun 24, 2023
@jpinsonneau jpinsonneau added the breaking-change This pull request has breaking changes. They should be described in PR description. label Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved breaking-change This pull request has breaking changes. They should be described in PR description. enhancement New feature or request jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants